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\ Abstract 

| Quantile regression is a method to estimate the quantiles of the conditional distribution of 

£SJ ■ a response variable, and as such it permits a much more accurate portrayal of the relationship 

between the response variable and observed covariates than methods such as Least-squares 
' or Least Absolute Deviations regression. It can be expressed as a linear program, and, with 

appropriate preprocessing, interior-point methods can be used to find a solution for moderately 
large problems. Dealing with very large problems, e.g., involving data up to and beyond the 
terabyte regime, remains a challenge. Here, we present a randomized algorithm that runs in 
nearly linear time in the size of the input and that, with constant probability, computes a (1 + e) 
| approximate solution to an arbitrary quantile regression problem. As a key step, our algorithm 

^1 ■ computes a low-distortion subspace-preserving embedding with respect to the loss function of 

quantile regression. Our empirical evaluation illustrates that our algorithm is competitive with 
■ the best previous work on small to medium-sized problems, and that in addition it can be 

implemented in MapReduce-like environments and applied to terabyte-sized problems. 

! 1 Introduction 

00 , 

Quantile regression is a method to estimate the quantiles of the conditional distribution of a response 
variable, expressed as functions of observed covariates [7j, in a manner analogous to the way in 
which Least-squares regression estimates the conditional mean. The Least Absolute Deviations 
, regression (i.e., t\ regression) is a special case of quantile regression that involves computing the 

median of the conditional distribution. In contrast with l\ regression and the more popular £2 or 
Least-squares regression, quantile regression involves minimizing asymmetrically-weighted absolute 
residuals. Doing so, however, permits a much more accurate portrayal of the relationship between 
the response variable and observed covariates, and it is more appropriate in certain non-Gaussian 
settings. For these reasons, quantile regression has found applications in many areas, e.g., survival 
analysis and economics [HOE]. As with l\ regression, the quantile regression problem can be 
formulated as a linear programming problem, and thus simplex or interior-point methods can be 
applied [HJ [T31 03] . Most of these methods are efficient only for problems of small to moderate 
size, and thus to solve very large-scale quantile regression problems more reliably and efficiently, 
we need new computational techniques. 

In this paper, we provide a fast algorithm to compute a (1 + e) relative-error approximate 
solution to the over-constrained quantile regression problem. Our algorithm constructs a low- 
distortion subspace embedding of the form that has been used in recent developments in randomized 
algorithms for matrices and large-scale data problems, and our algorithm runs in time that is nearly 
linear in the number of nonzeros in the input data. 
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In more detail, recall that a quantile regression problem can be specified by a (design) matrix 
A £ M nxd , a (response) vector b E R n , and a parameter r £ (0, 1), in which case the quantile 
regression problem can be solved via the optimization problem 

minimize^gjgd p T (b — Ax), (1) 

where p T {x) = Ylt=i Pr{xi), for x E K d , where 

/ \ J tz, * > 0; 

[(r-l)z, z < 0, 

for z E R, is the corresponding loss function. In the remainder of this paper, we will use A to denote 
the augmented matrix [i> — A\, and we will consider A £ M nxrf . With this notation, the quantile 
regression problem of ([1]) can equivalently be expressed as a constrained optimization problem with 
a single linear constraint, 

minimize x . g c Pt(Ax), (3) 

where C = {x £ M d \ c T x = 1} and c is a unit vector with the first coordinate set to be 1. We will 
focus on very over-constrained problems with size n> d. 

Our main algorithm depends on a technical result, presented as Lemma O which is of indepen- 
dent interest. Let A £ W ixd be an input matrix, and let S £ M. sxn be a random sampling matrix 
constructed based on the following importance sampling probabilities, 

Pi = min{l,s • ||f7(i)||i/||[/||i}, 

where || • ||i is the element- wise £i-norm, and where Um is the i-th row of an l\ well-conditioned 
basis U for the range of A (see Definition [3] and Lemma [TO]) . Then, Lemma [9] states that, for a 
sampling complexity s that depends on d but is independent of n, 

(1 - e)p T (Ax) < p T (SAx) < (1 + e)p T (Ax) 

will be satisfied for every x £ M. d . 

Although one could use, e.g., the algorithm of [6] to compute such a well-conditioned basis U and 
then "read off" the £i-norm of the rows of U, doing so would be much slower than the time allotted 
by our main algorithm. Thus, to apply the ideas from Lemma [9] for fast quantile regression, we 
provide two sets of additional results. First, we describe three algorithms (Algorithm^ Algorithm^ 
and Algorithm [3]) for computing an implicit representation of a well-conditioned basis; and second, 
we describe an algorithm (Algorithm S]) for approximating the £i-norm of the rows of the well- 
conditioned basis from that implicit representation. For each of these algorithms, we prove quality- 
of-approximation bounds, and we show that they run in nearly "input-sparsity" time, i.e., in 
0(nnz(A) ■ logn) time, where nnz(A) is the number of nonzero elements of A, plus lower-order 
terms. These lower-order terms depend on the time to solve the subproblem we construct; and 
they depend on the smaller dimension d but not on the larger dimension n. Although of less 
interest in theory, these lower-order terms can be important in practice, as our empirical evaluation 
will demonstrate. 

We should note that, of the three algorithms for computing a well-conditioned basis, the first 
two appear in [12] and are stated here for completeness; and the third algorithm, which is new 
to this paper, is not uniformly better than either of the two previous algorithms with respect 
to either condition number or the running time. (In particular, Algorithm [1] has slightly better 
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running time, and Algorithm [2] has slightly better conditioning properties.) Our new conditioning 
algorithm is, however, only slightly worse than the better of the two previous algorithms with 
respect to each of those two measures. Because of the tradeoffs involved in implementing quantile 
regression algorithms in practical settings, our empirical results show that by using a conditioning 
algorithm that is only slightly worse than the best previous conditioning algorithms for each of 
these two criteria, our new conditioning algorithm can lead to better results than either of the 
previous algorithms that was superior by only one of those criteria. 

Given these results, our main algorithm for quantile regression is presented as Algorithm [5j Our 
main theorem for this algorithm, Theorem [H states that, with constant probability, this algorithm 
returns a (1 + e)-approximate solution to the quantile regression problem; and that this solution 
can be obtained in 0(nnz(yl) • logra) time, plus the time for solving the subproblem (whose size is 
0(/wi 3 log(/u/e)/e 2 ) x d, where fi = j^—, independent of n, when r 6 [1/2, 1)). 

We also provide a detailed empirical evaluation of our main algorithm for quantile regression, 
including characterizing the quality of the solution as well as the running time, as a function of the 
high dimension n, the lower dimension d, the sampling complexity s, and the quantile parameter 
r. Among other things, our empirical evaluation demonstrates that the output of our algorithm is 
2-digit accurate, in terms of both objective function value and actual solution quality (by the latter, 
we mean a norm of the difference between the exact solution to the full problem and the solution to 
the subproblem constructed by our algorithm), to the exact quantile regression by sampling only, 
e.g., about 0.001% of a problem with size lelOx 100 Our new conditioning algorithm outperforms 
other conditioning-based methods, and it permits much larger small dimension d than previous 
conditioning algorithms. In addition to evaluating our algorithm on moderately-large data that can 
fit in RAM, we also show that our algorithm can be implemented in MapReduce-like environments 
and applied to computing the solution of terabyte-sized quantile regression problems. 

The best previous algorithm for moderately large quantile regression problems is due to [H] and 
|13j . Their algorithm uses an interior-point method on a smaller problem that has been preprocessed 
by randomly sampling a subset of the data. Their preprocessing step involves predicting the sign 
of each A^x* — bi, where Au\ and bi are the i-ih row of the input matrix and the i-th element 
of the response vector, respectively, and x* is an optimal solution to the original problem. When 
compared with our approach, they compute an optimal solution, while we compute an approximate 
solution; but we provide worst-case analysis that with high probability our algorithm is guaranteed 
to work, while they do not. Also, the sampling complexity of their algorithm depends on the higher 
dimension n, while the number of samples required by our algorithm depends only on the lower 
dimension d; but our sampling is with respect to a carefully-constructed nonuniform distribution, 
while they sample uniformly at random. 

For a detailed overview of recent work on using randomized algorithms to compute approximate 
solutions for least-squares regression and related problems, see the recent review [11]. Most relevant 
for our work is the algorithm of [6] that constructs a well-conditioned basis by ellipsoid rounding and 
a subspace-preserving sampling matrix in order to approximate the solution of general £ p regression 
problems, for p G [1, oo), in roughly 0(nd 5 log re); the algorithms of [15] and [5] that use the "slow" 
and "fast' versions of Cauchy Transform to obtained a low-distortion t\ embedding matrix and 
solve the over-constrained l\ regression problem in 0{nd l ^ 7 ® + ) and 0(nd log n) time, respectively; 
and the algorithm of [12] that constructs low-distortion embeddings in "input sparsity" time and 
uses those embeddings to approximate the solution of the over-constrained l\ regression problem 
in C(nnz(^4) • logn + poly (d) log (l/e)/e 2 ) time. In particular, we will use the two conditioning 
methods in [12], as well as our "improvement" of those two methods, for constructing £i-norm 

1 We use this notation throughout; emphe.g., by lelO x 10, we mean that n = 1 X 10 10 and d = 50. 
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well-conditioned basis matrices in nearly input-sparsity time. 

2 Preliminaries 

We use || • ||i to denote the element-wise i\ norm for both vectors and matrices; and we use [n] 
to denote the set {1,2, ... ,n}. For any matrix A, and A& denote the i-th row and the j-th 
column of A, respectively; and A denotes the column space of A. For simplicity, we assume A has 
full column rank; and we always assume that r > ^. All the results hold for r < ^ by simply 
switching the positions of r and 1 — r. 

Although p T (-), defined in Eqn. [21 is not a norm, since the loss function does not have the 
positive linearity, it satisfies some "good" properties, as stated in the following lemma: 

Lemma 1. Suppose that r > ^. Then, for any x,y G M. d ,a > 0, the following hold: 

1. p T (x + y) < p T (x) + p T (y); 

2. (1 -t)||x||i < p T (x) < t\\x\\ i; 

3. p T {ax) = ap T (x); and 

4- \Pt(x) - p T (y)\ < r\\x - y\\±. 

Proof. It is trivial to prove every equality or inequality for x, y in one dimension. Then by the 
definition of p T (-) for vectors, the inequalities and equalities hold for general x and y. □ 

To make our subsequent presentation self-contained, here we will provide a brief review of 
recent work on subspace embedding algorithms. We start with the definition of a low-distortion 
embedding matrix for A in terms of || • ||i [12] . 

Definition 1 (Low-distortion l\ Subspace Embedding). Given A £ W nxd , U G W xn is a low- 
distortion embedding of A if r = poly(<i) and for all x £ M. d , 

(1/ki)||Ae||i < ||IL4aj||i < k 2 ||^||i- 
where k\ and K2 are low-degree polynomials of d. 

The following stronger notion of a (1 ± e)-distortion subspace-preserving embedding will be crucial 
for our method. In this paper, the "measure functions" we will consider are || • ||i and p r { )- 

Definition 2 ((l±e)-distortion Subspace-preserving Embedding). Given A G M. nxd and a measure 
function of vectors /(•), S G ]R sxn is a (1 ± e)-distortion subspace-preserving matrix of (A, /(•)) if 
s = poly(<i) and for all x G M. d , 

(1 - e)f(Ax) < f(SAx) < (1 + e)f(Ax). 

Furthermore, if S is a sampling matrix (one nonzero element per row in S), we call it a (1 ± e)- 
distortion subspace-preserving sampling matrix. 

In addition, the following notion, originally introduced by [6], of a basis that is well-conditioned for 
the l\ norm will also be crucial for our method. 
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Definition 3 ((a, /^-conditioning and well-conditioned basis). Given A G M. nxd , A is (a, (3)- 
conditioned if \\A\\i < a and for all x G M q , \\x\\oo < /3\\Ax\\i. Define k(A) as the minimum value 
of a {3 such that A is (a, f3)- conditioned. We will say that a basis U of A is a well-conditioned basis 
if k = k(U) is a low-degree polynomial in d, independent of n. 

For a low-distortion embedding matrix for (A, || • ||i), we next state a fast construction algorithm 
that runs in "input-sparsity" time by applying the Sparse Cauchy Transform. This was originally 
proposed as Theorem 2 in [12] . 

Lemma 2 (Fast construction of Low-distortion l\ Subspace Embedding Matrix, from [12]). Given 
A G M nxd with full column rank, let LIi = SC G M riXn , where S G W lXn has each column chosen 
independently and uniformly from the r\ standard basis vector of W 1 , and where C G R nxn is a 
diagonal matrix with diagonals chosen independently from Cauchy distribution. Set n = ud 5 log 5 d 
with u) sufficiently large. Then, with a constant probability, we have 

l/0(d 2 log 2 d)-\\Ax\\ 1 <\\U 1 Ax\\ 1 <0(dlogd)-\\Ax\\ 1 , Vx G R d . (4) 

In addition, Ii\A can be computed in 0(imz(A)) time. 

Next, we state a result for the fast construction of a (l±e)-distortion subspace-preserving sampling 
matrix for (^4, || • ||i), from Theorem 5.4 in [5], with p = 1, as follows. 

Lemma 3 (Fast construction of l\ Sampling Matrix, from Theorem 5.4 in [5]). Given a matrix 
A G M. nxd and a matrix R G M. dxd such that AR' 1 is a well- conditioned basis for A with condition 
number k, it takes C(nnz( J 4) • logn) time to compute a sampling matrix S G M sxn with s = 
0(Kdlog(l/e)/e 2 ) such that with a constant probability, for any x G 

(1 - e)||Ac||i < ||SAr||i < (1 + e)||Ac||i. 

We also cite the following lemma for finding a matrix R, such that AR~ l is a well-condition basis, 
which is based on ellipsoidal rounding proposed in [5]. 

Lemma 4 (Fast Ellipsoid Rounding, from [5]). Given an n x d matrix A, by applying an ellipsoid 
rounding method, it takes at most 0{nd? log n) time to find a matrix R G M. dxd such that k(AR~ 1 ) < 
2d 2 . 

Finally, two important ingredients for proving subspace preservation are 7-nets and tail inequal- 
ities. Suppose that Z is a point set and || • || is a metric on Z. A subset Z 7 is called as a 7-net for 
some 7 > if for every x £ Z there is a y G Z^ such that ||x — y\\ < 7. It is well-known that the 
unit ball of a d-dimensional subspace has a 7-net with size at most (3/7) d [I]. Also, we will use the 
standard Bernstein inequality to prove concentration results for the sum of independent random 
variables. 

Lemma 5 (Bernstein inequality, [T]). Let X\, . . . ,X n be independent random variables with zero- 
mean. Suppose that \Xi\ < M , for i G [n], then for any positive number t, we have 

t 2 /2 \ 
Hi^X 2 + Mt/2,]- 
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name 


running time 


K 


type 


sc [E] 

Fcm 

Ellipsoid rounding [3] 
Fast ellipsoid rounding [5] 


0(nd 2 logd) 
0(ndlog d) 
0(nd 5 logn)) 
0(nd 3 logn)) 


0(d 5 / 2 \og 3/2 n) 
0(d 7 / 2 log 5 / 2 ?i) 
d 3 l 2 (d + l) 1 / 2 
2d 2 


QR 
QR 
ER 
ER 


spci p2] 


C(nnz(j4) • logn) 


0(cfT log^ d) 


QR 


spc2 m] 


0(nnz(A) ■ logn) + ER_small 


6d 2 


QR+ER 


SPC3 (proposed in this article) 


0(nnz(A) • logn) + QR_small 


19 11 

C?(dT bgx d) 


QR+QR 



Table 1: Summary of running time, condition number, and type of conditioning methods proposed 
recently. QR and ER refer, respectively, to methods based on the QR factorization and methods 
based on Ellipsoid Rounding, as discussed in the text. QFLsmall and EFLsmall denote the running 
time for applying QR factorization and Ellipsoid Rounding, respectively, on a small matrix with 
size independent of n. 



3 Main Theoretical Results 

In this section, we present our main theoretical results on (1 ± e)-distortion subspace-preserving 
embeddings and our fast randomized algorithm for quantile regression. Before presenting these 
results, we start with the theory for conditioning. 

3.1 Several conditioning methods 

The concept of a well-conditioned basis U (recall Definition [3]) plays an important role in our 
algorithms, and thus in this subsection we will discuss several related conditioning methods. By a 
conditioning method, we mean an algorithm for finding, for an input matrix A, a well-conditioned 
basis, i.e., either finding a well-conditioned matrix U or finding a matrix R such that U = AR~ l 
is well-conditioned. There exist many approaches that have been proposed for conditioning. The 
two most important properties of these methods for our subsequent analysis are: (1) the condition 
number k = a(3; and (2) the running time to construct U (or R). The importance of the running 
time should be obvious; but the condition number directly determines the number of rows that we 
need to select, and thus it has an indirect effect on running time (via the time required to solve the 
subproblem). See Table [U for a summary of the basic properties of the conditioning methods that 
will be discussed in this subsection. 

In general, there are three basic ways for finding a matrix R such that U = AR~ l is well- 
conditioned: those based on the QR factorization; those based on Ellipsoid Rounding; and those 
based on combining the two basic methods. 

• Via QR Factorization (QR). To obtain a well-conditioned basis, one can first construct 
a low-distortion t\ embedding matrix. By Definition this means finding a II G W xd , such 
that for any x & M. d , 

(l/ K i)||Ar||i < ||IL4s||i < k 2 \\Ax\\u ( 5 ) 

where r <C n and is independent of n and the factors K\ and k<i here will be low-degree 
polynomials of d (and related to a and (5 of Definition [3]) . For example, II could be Sparse 
Cauchy Transform described in Lemma El After obtaining LT, by calculating a matrix R such 
that II^4i? _1 has orthonormal columns, the matrix AR~ l is a well-conditioned basis with 
K < dy/vKi K2 ■ See Theorem 4.1 in [12] for more details. Here, the matrix R can be obtained 
by a QR factorization (or, alternately, the Singular Value Decomposition). As the choice of 
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II varies, the condition number of AR , i.e., k(AR 1 ), and the corresponding running time 
will also vary, and there is in general a tradeoff among these. 

For simplicity, the acronyms for these types of conditioning methods will come from the name 
of the corresponding transformations: SC stands for Slow Cauchy Transform from [15]; FC 
stands for Fast Cauchy Transform from [5j; and SPC1 (see Algorithm [1]) will be the first 
method based on the Sparse Cauchy Transform (see Lemma [2]) . We will call the methods 
derived from this scheme QR-type methods. 

• Via Ellipsoid Rounding (ER). Alternatively, one can compute a well-conditioned basis 
by applying ellipsoid rounding. This is a deterministic algorithm that computes a 77-rounding 
of a centrally symmetric convex set C = {x £ M d |||j4x||i < 1}. By 77-rounding here we 
mean finding an ellipsoid £ = {x £ M. d \ \\Rx\\2 < 1}, satisfying £ /r/ C C C £, which implies 
|| Rx || 2 < \\Ax\\ 1 < 77 1 1 .Ra; || 2, Vx E M. d . With a transformation of the coordinates, it is not hard 
to show the following, 

1Mb < ||AR _1 :c||i < ??||x|| 2 . (6) 
From this, it is not hard to show the following inequalities, 

HAR^Hi < \\ AR ~ le ih < Yl ^IMI 2 ^ dr li 
je[d\ je[d] 

||AR -1 :r||i > \\x\\2 > 1 1 1 1 oo . 

This directly leads to a well-conditioned matrix U = AR^ 1 with k < drj. Hence, the problem 
boils down to finding a r/-rounding with r\ small in a reasonable time. 

By Theorem 2.4.1 in [TO], one can find a (d(d + l)) 1,/2 -rounding in polynomial time. This 
result was used by [4] and [6]. As we mentioned in the previous section, Lemma [H in [5], a 
new fast ellipsoid rounding algorithm was proposed. For an n x d matrix A with full rank, 
it takes at most 0(nd 3 logn) time to find a matrix R such that AR -1 is a well-conditioned 
basis with k < 2d 2 . We will call the methods derived from this scheme ER-type methods. 

• Via Combined QR+ER Methods. Finally, one can construct a well-conditioned basis by 
combining QR-like and ER-like methods. For example, after we obtain R such that AR^ 1 is 
a well-conditioned basis, as described in LemmaEl one can then construct a (1 ± e)-distortion 
subspace-preserving sampling matrix S in C(nnz( J 4) • logn) time. We may view that the 
price we pay for obtaining S is very low in terms of running time. Since S is a sampling 
matrix with constant distortion factor and since the dimension of SA is independent of n, we 
can apply additional operations on that smaller matrix in order to obtain a better condition 
number, without much additional running time, in theory at least, if n ^> poly(d), for some 
low-degree poly(d). 

Since the bottleneck for ellipsoid rounding is its running time, when compared to QR-type 
methods, one possibility is to apply ellipsoid rounding on SA. Since the bigger dimension 
of SA only depends on d, the running time for computing R via ellipsoid rounding will be 
acceptable if n ^ poly(d). As for the condition number, for any general i\ subspace embed- 
ding II satisfying Eqn. ([5]), i.e., which preserves the t\ norm up to some factor determined by 
d, including S, if we apply ellipsoid rounding on n^4, then the resulting R may still satisfy 
Eqn. © with some rj. In detail, viewing R~ 1 x as a vector in M. d , from Eqn. ([5]), we have 

(l/^llnAR-^Hi < IIAR-^Hi < KiHlIAR-^xlli. 
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Algorithm 1 SPC1: vanilla QR type method with Sparse Cauchy Transform 
Input: A G W ixd with full column rank. 

Output: R- 1 G M. dxd such that AR~ l is a well-conditioned basis with k < Oid 2 log 2 d). 
1: Construct a low-distortion embedding matrix IIi G R riXn of (.4, || • ||i) via Lemma [2j 
2: Compute R G M dxd such that AR' 1 is a well-conditioned basis for A via QR factorization of 
IIiA 



In Eqn. ©, replace ^4 with IIA, combining the inequalities above, we have 

(1/k 2 )|M| 2 < ||AR _1 a;||i < ?7tti|M|2- 

With appropriate scaling, one can show that AR~ l a well-conditioned matrix with k = 
dr\K\K<i- Especially, when S has constant distortion, say (1 ± 1/2), the condition number 
is preserved at sampling complexity 0(d 2 ), while the running time has been reduced a lot, 
when compared to the vanilla ellipsoid rounding method. (See Algorithm [2] (SPC2) below for 
a version of this method.) 

A second possibility is to view 5 as a sampling matrix satisfying Eqn. (JSJ) with II = S. Then, 
according to our discussion of the QR-type methods, if we compute the QR factorization of 
SA, we may expect the resulting AR~ l to be a well-conditioned basis with lower condition 
number k. As for the running time, QR factorization on a smaller matrix will be inconse- 
quential, in theory at least. (See Algorithm [3] (SPC3) below for a version of this method.) 

In the remainder of this subsection, we will describe three related methods for computing a well- 
conditioned basis that we will use in our empirical evaluations. Recall that Table Q] provides a 
summary of these three methods and the other methods that we will use. 

We start with the algorithm obtained when we use Sparse Cauchy Transform from [12] as the 
random projection II in a vanilla QR-type method. We call it SPC1 since we will describe two of 
its variants below. Our main result for Algorithm Q] is given in Lemma El Since the proof is quite 
straightforward, we omit it here. 

Lemma 6. Given A G M. nxd with full rank, Algorithm^ takes 0(nnz(^4) • logn) time to compute 
a matrix R G ~R dxd such that with a constant probability, AR -1 is a well- conditioned basis for A 
with k < 0(d~ log^~ d). 

Next, we summarize the two Combined Methods described above in Algorithm [2] and Algo- 
rithm[3l Since they are variants of SPC1, we call them SPC2 and SPC3, respectively. Algorithm[2] 
originally appeared as first four steps of Algorithm 2 in [T7| . Our main result for Algorithm [2] is 
given in Lemma since the proof of this lemma is very similar to the proof of Theorem 7 in |12] , 
we omit it here. Algorithm [3] is new to this paper. Our main result for Algorithm [3] is given in 
Lemma the proof of the lemma can be found in Appendix [Al 

Lemma 7. Given A G ~R nxd with full rank, Algorithm^ takes 0(nnz(^4) • logn) time to compute 
a matrix R G M. dxd such that with a constant probability, AR -1 is a well- conditioned basis for A 
with k < 6d 2 . 

Lemma 8. Given A G ~R nxd with full rank, Algorithm^ takes 0(miz(A) ■ logn) time to compute 

a matrix R G ~R dxd such that with a constant probability, AR" 1 is a well- conditioned basis for A 

19 11 
with n < 0{d~ \og~i~ d). 
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Algorithm 2 SPC2: QR + ER type method with Sparse Cauchy Transform 
Input: A G M nxd with full column rank. 

Output: R- 1 G M. dxd such that AR~ l is a well-conditioned basis with k < 6d . 
1: Construct a low-distortion embedding matrix IIi G M riXn of (A, \\ ■ ||i) via Lemma [2j 
2: Construct ^ G M dxd such that AfT 1 is a well-conditioned basis for A via QR factorization of 
ILA 

3: Compute a (1 ± l/2)-distortion sampling matrix 5 G ]RP ol y( d ) xn of (A, || • ||i) via Lemma El 
4: Compute R G R rfxd by ellipsoid rounding for SA via Lemma [U 



Algorithm 3 SPC3: QR + QR type method with Sparse Cauchy Transform 
Input: A G MJ lxd with full column rank. 

Output: R- 1 G R dxd such that AR~ l is a well-conditioned basis with k < 0(d ± log 4 d). 
1: Construct a low-distortion embedding matrix IIi G IR r ' lXn of (A, \\ ■ ||i) via Lemma [2j 
2: Construct R G R dxd such that AR -1 is a well-conditioned basis for ^4 via QR factorization of 
UiA. 

3: Compute a (1 ± l/2)-distortion sampling matrix S G W°W d ) xn of (A, \\ ■ \\ i) via Lemma [3l 
4: Compute it! G M dx<i via the QR factorization of SA. 



Both Algorithm [2] and Algorithm [3] have additional steps (Steps 3 & 4), when compared with 
Algorithm [H and this leads to some improvements, at the cost of additional computation time. 
For example, in Algorithm [3] (SPC3), we obtain a well-conditioned basis with smaller k when 
comparing to Algorithm [1] (SPC1). As for the running time, it will be still C(nnz(A) • log re), since 
the additional time is for constructing sampling matrix and solving a QR factorization of a matrix 
whose dimensions are determined by d. Note that when we summarize these results in Table [TJ 
we explicitly list the additional running time for SPC2 and SPC3, in order to show the tradeoff 
between these SPC-derived methods. We will evaluate the performance of all these methods on 
quantile regression problems in Section [4] (except for FC, since it is similar to but worse than SPC1, 
and ellipsoid rounding, since on the full problem it is too expensive). 

Remark. For all the methods we described above, the output is not the well-conditioned matrix 
U, but instead it is the matrix R, the inverse of which transforms A into U. 

Remark. As we can see in Table [lj with respect to conditioning quality, SPC2 has the lowest 
condition number k, followed by SPC3 and then SPC1, which has the worst condition number. On 
the other hand, with respect to running time, SPC1 is the fastest, followed by SPC3, and then 
SPC1, which is the slowest. (The reason for this ordering of the running time is that SPC2 and 
SPC3 need additional steps and ellipsoid rounding takes longer running time that doing a QR 
decomposition. ) 

3.2 Main technical ingredients 

In this subsection, we present the main technical ingredients underlying our main algorithm for 
quantile regression. We start with a result which says that if we sample sufficiently many (but 
still only poly(d)) rows according to an appropriately-defined non-uniform importance sampling 
distribution (of the form given in Eqn. (|7|) below), then we obtain a (1 ± e)-distortion embedding 
matrix with respect to the loss function of quantile regression. Note that the form of this lemma, 
the proof of which may be found in Appendix [Bj is based on ideas from [61 [5]. 
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Lemma 9 (Subspace-preserving Sampling Lemma). Given A G R nxd , let U G M nxd 6e a well- 
conditioned basis for A with condition number k. For s > 0, define 

pi > min{l, s ■ \\U(i) ||i/||^||i}, (7) 

and let S G W ixn be a random diagonal matrix with Sa = 1/pi with probability pi, and otherwise. 
Then when e < 1/2 and 

^ t 27k / / r 18 \ . /4 

^T37— dlog +log U 



with probability at least 1 — 5, for every x G M d , 

(1 - e)p r (Ac) < p T {SAx) < (1 + e)p r (Aa:). (8) 

Remark. It is not hard to see that for any matrix S satisfying ([8]), the rank of A is preserved. 
Remark. Given such a low-distortion subspace-preserving sampling matrix, it is not hard to show 
that, by solving the sub-sampled problem induced by S, i.e., by solving min xG c Pt(SAx), then one 
obtains a (1 + e)/(l — e)-approximate solution to the original problem. For more details, see the 
proof for Theorem [TJ 

In order to apply Lemma[9]to quantile regression, we need to compute the sampling probabilities 
in Eqn. ([7|). This requires two steps: first, find a well-conditioned basis U; and second, compute 
the l\ row norms of U. For the first step, we can apply any method described in the previous 
subsection. (Other methods are possible, but Algorithms [TJ [2j and [3] are of particular interest due 
to their nearly input-sparsity running time. We will now present an algorithm that will perform 
the second step of approximating the l\ row norms of U in the allotted C(nnz(A) • log re) time. 

Suppose we have obtained R" 1 such that AR" 1 is a well-conditioned basis. Consider, next, 
computing pi from U (or from A and i? _1 ), and note that forming U explicitly is expensive both 
when A is dense and when A is sparse. In practice, however, we will not need to form U explicitly, 
and we will not need to compute the exact value of the £i-norm of each row of U. Indeed, it suffices 
to get estimates of ||J7(j)||i, in which case we can adjust the sampling complexity s to maintain a 
small approximation factor. Algorithm [J] provides a way to compute the estimates of the i\ norm 
of each row of U fast and construct the sampling matrix. The same technique was used in [5]. 
Our main result for Algorithm H] is presented in Lemma [10| the proof of which can be found in 
Appendix [Cl 

Lemma 10 (Fast Construction of (1 ± e)-distortion Sampling Matrix). Given a matrix A G M nxd , 
and a matrix R G ~R dxd such that AR" 1 is a well- conditioned basis for A with condition number k, 
Algorithm^ takes C(nnz(^4) • logn) time to compute a sampling matrix S G M sxn (with only one 
nonzero per row), such that with probability at least 0.9, S is a (1 ± e)-distortion sampling matrix. 
That is for all x G M d , 

(1 - e)p T {Ax) < p T (SAx) < (1 + e)p T (Ax). (9) 
Further, with probability at least 1 — o(l), s = O (/iAtdlog (/x/e) /e 2 ), where p = 

Remark. In the text before Lemma [101 s denotes an input parameter for defining the importance 
sampling probabilities. However, the actual sample size might be less than that. Since Lemma [TOl 
is about the construction of the sampling matrix S, we let s denote the actual number of row 
selected, which is the higher dimension of S. Also, as stated, the output of Algorithm [3] is a n x n 
matrix; but if we zero-out the all-zero rows, then the actual size of S is indeed s by d as described 
in Lemma [TOl 
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Algorithm 4 Fast Construction of (1 ± e)-distortion Sampling Matrix of (A,p T (-)) 

Input: A G M. nxd ,R G IR dxa! such that AR -1 is well-conditioned with condition number k, e G 

(0,1/2), re [1/2,1). 
Output: Sampling matrix S E R nxn . 

1: Let II2 G R rfxr " 2 be a matrix of independent Cauchys with r 2 = 151og(40n). 

2: Compute R~ l Yi 2 and construct A = AR~ l U 2 G M nxr2 . 

3: For i G [n], compute Aj = medianj e [ r2 ]|Ajj|. 

4: For s = jr7^ (rflog ( yt^^) + log 80 J and i G [n], compute probabilities 



Pi = min < 1, s 



5: Let S G M nxn be diagonal with independent entries 

5 with P rob ability j^; 

I 0, with probability 1 — pi. 



Algorithm 5 Fast Randomized Algorithm for Quantile Regression 

Input: A G R nxd with full column rank, e G (0, 1/2), r G [1/2, 1). 

Output: An approximated solution x G M d to problem minimize^gc Pr(Ax). 
1: Compute G R dxd such that AR~ l is a well-conditioned basis for A via Algorithm [H [2j or [3l 
2: Compute a (1 ± e)-distortion embedding S G M sxn of (A, p T (-)) via Algorithm HI 
3: Return x G K rf that minimizes p T (SAx) with respect to x G C. 



3.3 Main algorithm 

In this subsection, we state our main algorithm for computing an approximate solution to the 
quantile regression problem. Recall that, to compute a relative-error approximate solution, it 
suffices to compute a (1 ±e)-distortion sampling matrix S. To construct S, we first compute a well- 
conditioned basis U with Algorithm [H [21 or [3] (or some other conditioning method) , and then we 
apply Algorithm H] to approximate the l\ norm of each row of U. These procedures are summarized 
in Algorithm The main quality-of-approximation result for this algorithm by using Algorithm [2] 
is stated in Theorem [TJ the proof of which can be found in Appendix [Pi 

Theorem 1 (Fast Quantile Regression). Given A G W nxd and e G (0,1/2), if Algorithm^ is used 
in Step 1, Algorithm^ returns a vector x that, with probability at least 0.8, satisfies 

Pr(Ax)< fl±i\ Pr (Ax*), 

where x* is an optimal solution to the original problem. In addition, the algorithm to construct x 
runs in time 

0{nnz{A) ■ log n) + (0(pd 3 log(p/e)/e 2 ), d) , 

where /i = and 4>(s, d) is the time to solve a quantile regression problem of size s x d. 

Remark. As stated, Theorem Q] uses Algorithm [2] in Step 3; we did this since it leads to the best 
known running-time results in worst-case analysis, but our empirical results will indicate that due 
to various tradeoffs the situation is more complex in practice. 
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Remark. Our theory provides a bound on the solution quality, as measured by the objective 
function of the quantile regression problem, and it does not provide bounds for the difference 
between the exact solution vector and the solution vector returned by our algorithm. We will, 
however, compute this latter quantity in our empirical evaluation. 



4 Empirical Evaluation on Medium-scale Quantile Regression 

In this section and the next section, we present our main empirical results. We have evaluated an 
implementation of Algorithm [5] using several different conditioning methods in Step 1. We have 
considered both simulated data and real data, and we have considered both medium-sized data as 
well as terabyte-scale data. In this section, we will summarize our results for medium-sized data. 
The results on terabyte-scale data can be found in Section [SJ 



Simulated skewed data. For the synthetic data, in order to increase the difficulty for sam- 
pling, we will add imbalanced measurements to each coordinates of the solution vector. A similar 
construction for the test data was appeared in [5]. Due to the skewed structure of the data, we 
will call this data set "skewed data" in the following discussion. This data set is generated in the 
following way. 

1. Each row of the design matrix A is a canonical vector. Suppose the number of measurements 
on the j-th column are Cj, where Cj = qcj-i, for j = 2, . . . , d. Here 1 < q < 2. A is a n x d 
matrix. 

2. The true vector x* with length d is a vector with independent Gaussian entries. Let b* = Ax* . 

3. The noise vector e is generated with independent Laplacian entries. We scale e such that 

f 500ei with probability 0.001; 



|6*|| = 0.2. The response vector is given by bi 



K + U otherwise. 



When making the experiments, we require c\ > 161. This implies that if we choose s/n > 0.01, 
and perform the uniform sampling, with probability at least 0.8, at least one row in the first block 
(associated with the first coordinate) will be selected, due to 1 — (1 — 0.01) 161 > 0.8. Hence, if we 
choose s > O.Olra, we may expect uniform sampling produce acceptable estimation. 



Real census data. For the real data, we consider a data set consisting of a 5% sample of the 
U.S. 2000 Census data!, consisting of annual salary and related features on people who reported 
that they worked 40 or more weeks in the previous year and worked 35 or more hours per week. 
The size of the design matrix is 5 x 10 6 by 11. 

The remainder of this section will consist of six subsections, the first five of which will show the 
results of experiments on the skewed data, and then Section 14.61 which will show the results on 
census data. In more detail, Section 14.11 14.21 14.31 an d 14.41 will summarize the performance of the 
methods in terms of solution quality as the parameters s, n, d, and r, respectively, are varied; and 
Section [4.51 will show how the running time changes as s, n, and d change. 

2 U.S. Census, [http:// www . census . gov/ census2000/PUMS5 . html 



12 



4.1 Quality of approximation when the sampling size s changes 

As discussed in Section [3.11 we can use one of several methods for the conditioning step, i.e., for 
finding the well-conditioned basis U = ART 1 in the Step 1 of Algorithm Here, we will consider 
the empirical performance of six methods for doing this conditioning step, namely: SC, SPC1, 
SPC2, SPC3, NOCO, and UNIF. The first four methods (SC, SPC1, SPC2, SPC3) are described in 
Section I3.lt NOCO stands for "no conditioning," meaning the matrix R in the conditioning step is 
taken to be identity; and, UNIF stands for the uniform sampling method, which we include here for 
completeness. Note that, for all the methods, we compute the row norms of the well-conditioned 
basis exactly instead of estimating them with Algorithm [H The reason is that this permits a cleaner 
evaluation of the quantile regression algorithm, as this may reduce the error due to the estimating 
step. We have, however, observed similar results if we approximate the row norms well. 

Rather than determining the sample size from a given tolerance e, we let the sample size s vary 
in a range as an input to the algorithm. Also, for a fixed data set, we will show the results when 
r = 0.5,0.75,0.95. In our figure, we will plot the first and the third quartiles of the relative errors 
of the objective value and solution measured in three different norms from 50 independent trials. 
We restrict the y axis in the plots to the range of [0, 100] to show more details. We start with a 
test on skewed data with size le6 x 50. (Recall that, by le6 x 50, we mean that n = 1 x 10 6 and 
d = 50.) The resulting plots are shown in Figure [TJ 

From these plots, if we look at the sampling size required for generating at least 1-digit accuracy, 
then SPC2 needs the fewest samples, followed by SPC3, and then SPC1. This is consistent with the 
order of the condition numbers of these methods. For SC, although in theory it has good condition 
number properties, in practice it performs worse than other methods. Not surprisingly, NOCO and 
UNIF are not reliable when s is very small, e.g., less than le4. 

When the sampling size s is large enough, the accuracy of each conditioning method is close to 
the others in terms of the objective value. Among these, SPC3 performs slightly better than others. 
When estimating the actual solution vectors, the conditioning-based methods behave substantially 
better than the two naive methods. SPC2 and SPC3 are the most reliable methods since they can 
yield the least relative error for every sample size s. NOCO is likely to sample the outliers, and UNIF 
cannot get accurate answer until the sampling size s > le4. This accords with our expectations. 
For example, when s is less than le4, as we pointed out in the remark below the description of 
the skewed data, it is very likely that none of the rows in the first block corresponding to the 
first coordinate will be selected. Thus, poor estimation will be generated due to the imbalanced 
measurements in the design matrix. Note that from the plots we can see that if a method fails with 
some sampling complexity s, then for that value of s the relative errors will be huge (e.g., larger 
than 100, which is clearly a trivial result). Note also that all the methods can generate at least 
1-digit accuracy if s is large enough. 

It is worth mentioning the performance difference among SPC1, SPC2 and SPC3. From Tabled! 
we show the tradeoff between running time and condition number for the three methods. As we 
pointed out, SPC2 always needs the least sampling complexity to generate 2-digit accuracy, followed 
by SPC3 and then SPC1. When s is large enough, SPC2 and SPC3 perform substantially better 
than SPC1. As for the running time, SPC1 is the fastest, followed by SPC3, and then SPC2. Again, 
all of these follow the theory about our SPC methods. We will present a more detailed discussion 
for the running time in Section 14.51 

Although our theory doesn't say anything about the quality of the solution vector itself (as 
opposed to the value of the objective function), we evaluate this here. To measure the approximation 
to the solution vectors, we use three norms (the l\, ti, and norms). From Figure [U we see 
that the performance among these method is qualitatively similar for each of the three norms, but 
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Figure 1: The first (solid lines) and the third (dashed lines) quartiles of the relative errors 
of the objective value (namely, \f — f*\/\f*\) and solution vector (measured in three different 
norms, namely, the ^2?-^i and norms), by using 6 different methods, among 50 independent 
trials. The test is on skewed data with size le6 by 50. The three different columns correspond to 
r = 0.5, 0.75, 0.95, respectively. 
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the relative error is higher when measured in the norm. In more detail, see Table [21 where 
we show the exact quartiles of the relative error on vectors for each methods for s = 5e4 and 
r = 0.75. Not surprisingly, NOCO and UNIF are not among the reliable methods when s is small 
(and they get worse when s is even smaller). Note that the relative error for each method doesn't 
change substantially when r takes different values. We present a more detailed discussion of the r 
dependence in Section 14.41 

(We note also that, for subsequent figures in subsequent subsections, we obtained similar qual- 
itative trends for the errors in the approximate solution vectors when the errors were measured in 
different norms. Thus, due to this similarity and to save space, in subsequent figures, we will only 
show errors for £2 norm.) 



SC 
SPC1 
SPC2 
SPC3 
NOCO 
UNIF 



[0.0121, 0.0172] [0.0093, 0.0122] [0.0229, 0.0426] 

[0.0108, 0.0170] [0.0081, 0.0107] [0.0198, 0.0415] 

[0.0079, 0.0093] [0.0061, 0.0071] [0.0115, 0.0152] 

[0.0094, 0.0116] [0.0086, 0.0103] [0.0139, 0.0184] 

[0.0447, 0.0583] [0.0315, 0.0386] [0.0769, 0.1313] 

[0.0396, 0.0520] [0.0287, 0.0334] [0.0723, 0.1138] 



Table 2: The first and the third quartiles of relative errors of the solution vector, measured in £1, 
£2, and £oo norms. The test data set is the skewed data, with size le6 x 50, the sampling size 
s = 5e4, and r = 0.75. 



4.2 Quality of approximation when the higher dimension n changes 

Next, we describe how the performance of our algorithm varies when higher dimension n changes. 
(We present the results when the lower dimension d changes in Section 14.31 ) Figures [2] and [3] 
summarize our results. 

Figure [2] shows the performance of the relative error of the objective value and solution vector 
by using the six different methods, as n is varied, for fixed values of r = 0.75 and d = 50. For 
each row, the three figures come from three data sets with n taking value in le5, 5e5, le6. (Recall 
that, in these experiments, we only list the plots showing the relative error on vectors measured 
in £2 norm. Since the plots for the £\ and £00 norm are similar, we omit them.) We see that, 
when d is fixed, the basic structure in the plots that we observed before is preserved when n takes 
three different values. In particular, the minimum sampling complexity s needed for each method 
for yielding high accuracy does not vary a lot. When s is large enough, the relative performance 
among all the methods is similar; and, when all the parameters are fixed except for n, the relative 
error for each method does not change quantitatively. 

We will also let n take a wider range of values. Figure [3] shows the change of relative error on the 
objective value and solution vector by using SPC3 and letting n vary from le4 to le6 and d = 50 
fixed. Recall, from Theorem [TJ that for given a tolerance e, the required sampling complexity s 
depends only on d. That is, if we fix the sampling size s and d, then the relative error should not 
vary much, as a function of n. If we inspect Figure [3l we see that the relative errors are almost 
constant as a function of increasing n, provided that n is much larger than s. When s is very 
close to n, since we are sampling roughly the same number of rows as in the full data, we should 
expect lower errors. Also, we can see that by using SPC3, relative errors remain roughly the same 
in magnitude. 
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Figure 2: The first (solid lines) and the third (dashed lines) quartiles of the relative errors of 
the objective value (namely, \f — f*\/\f*\) and solution vector (namely, \\x — x*\\2/\\x*\\2), when 
n changes and d = 50 by using 6 different methods, among 50 independent trials. The test is on 
skewed data and r = 0.75. The three different columns correspond ton = le5, 5e5, le6, respectively. 
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Figure 3: The first (solid lines) and the third (dashed lines) quartiles of the relative errors of 
the objective value (namely, \f — f*\/\f*\) and solution vector (namely, \\x — x*\\2/\\x*\\2), when n 
varying from le4 to le6 and d = 50 by using SPC3, among 50 independent trials. The test is on 
skewed data. The three different columns correspond to r = 0.5, 0.75, 0.95, respectively. 
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4.3 Quality of approximation when the lower dimension d changes 

Next, we describe how the overall performance changes when the lower dimension d changes. Fig- 
ures H] and [5] summarize our results. These figures show the same quantities that were plotted in 
the previous subsection, except that here it is the lower dimension d that is now changing, and the 
higher dimension n = le6 is fixed. In Figure HI we let d take values in 10, 50, 100, we set r = 0.75, 
and we show the relative error for all 6 conditioning methods. In Figure[5l we let d take more values 
in the range of [10, 100], and we show the relative errors by using SPC3 for different sampling sizes 
s and r values. 
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Figure 4: The first (solid lines) and the third (dashed lines) quartiles of the relative errors of 
the objective value (namely, \f — f*\/\f*\) and solution vector (namely, ||x — x*||2/||x*||2), when d 
changes and n = le6 by using 6 different methods, among 50 independent trials. The test is on 
skewed data and r = 0.75. The three different columns correspond to d = 10, 50, 100, respectively. 

For Figure HI as d gets larger, the performance of the two naive methods do not vary a lot. 
However, this increases the difficulty for conditioning methods to yield 2-digit accuracy. When d 
is quite small, most methods can yield 2-digit accuracy even when s is not large. When d becomes 
large, SPC2 and SPC3 provide good estimation, even when s < 1000. The relative performance 
among these methods remains unchanged. For Figure El the relative errors are monotonically 
increasing for each sampling size. This is consistent with our theory that, to yield high accuracy, 
the required sampling size is a low-degree polynomial of d. 

4.4 Quality of approximation when the quantile parameter r changes 

Next, we will let r change, for a fixed data set and fixed conditioning method, and we will investigate 
how the resulting errors behave as a function of r. We will consider r in the range of [0.5,0.9], 
equally spaced by 0.05, as well as several extreme quantiles such as 0.975 and 0.98. We consider 
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Figure 5: The first (solid lines) and the third (dashed lines) quartiles of the relative errors of 
the objective value (namely, \f — f*\/\f*\) and solution vector (namely, \\x — x*\\2/\\x*\\2), when d 
varying from 10 to 100 and n = le6 by using SPC3, among 50 independent trials The test is on 
skewed data. The three different columns correspond to r = 0.5, 0.75, 0.95, respectively. 
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skewed data with size le6 x 50; and our plots are shown in Figure [6j 
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Figure 6: The first (solid lines) and the third (dashed lines) quartiles of the relative errors of 
the objective value, namely \f — f*\/\f*\, and solution vector, namely, \\x — x*\\2/\\x*\\2, when r 
varying from 0.5 to 0.999 by using SPC1, SCP2, SPC3, among 50 independent trials. The test is 
on skewed data with size le6 by 50. Within each plot, three sampling sizes are considered, namely, 
le4, le4, le5. 

The plots in Figure [6] demonstrate that, given the same method and sampling size s, the 
relative errors are monotonically increasing but only very gradually, i.e., they do not change very 
substantially in the range of [0.5,0.95]. On the other hand, all the methods generate high relative 
errors when r takes extreme values very near 1 (or 0). Overall, SPC2 and SPC3 performs better 
than SPC1. Although for some quantiles SPC3 can yield slightly lower errors than SPC2, it too 
yields worst results when r takes on extreme values. 

4.5 Evaluation on running time performance 

In this subsection, we will describe running time issues, with an emphasis on how the running time 
behaves as a function of s, d and n. 

When the sampling size s changes 

To start, Figure [7] shows the running time for computing three subproblems associated with three 
different r values by using six methods (namely, SC, SPC1, SPC2, SPC3, NOCO, UNIF) when the 
sampling size s changes. (This is simply the running time comparison for all the six methods used 
to generate Figure [U) As expected, the two naive methods (NOCO and UNIF) run faster than 
other methods in most cases — since they don't perform the additional step of conditioning. For 
s < 10 4 , among the conditioning-based methods, SPC1 runs fastest, followed by SPC3 and then 
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SPC2. As s increases, however, the faster methods, including NOCO and UNIF, become relatively 
more expensive; and when s ~ 5e5, all of the curves, except for SPC1, reach almost the same point. 

To understand what is happening here, recall that we accept the sampling size s as an input 
in our algorithm; and we then construct our sampling probabilities by pi = min{l,s • Aj/^Aj}, 
where Aj is the estimation of the l\ norm of the i-th row of a well-conditioned basis. (See Step 4 
in Algorithm (H) Hence, the s is not the exact sampling size. Indeed, upon examination, in this 
regime when s is large, the actual sampling size is often much less than the input s. As a result, 
almost all the conditioning-based algorithms are solving a subproblem with size, say, s/2 x d, while 
the two naive methods are are solving subproblem with size about s x d. The difference of running 
time for solving problems with these sizes can be quite large when s is large. For conditioning- 
based algorithms, the running time mainly comes from the time for conditioning and solving the 
subproblem. Thus, since SPC1 needs the least time for conditioning, it should be clear why SPC1 
needs much less time when s is very large. 



The running time for each method 

10 2 F 




sample size 



Figure 7: The running time for solving the three problems associated with three different r values 
by using six methods, namely, SC, SPC1, SPC2, SPC3, NOCO, UNIF, when the sampling size s 
changes. 

When the higher dimension n changes 

Next, we compare the running time of our method with some competing methods when data size 
increases. The competing methods are the primal-dual method, referred to as ipm, and that with 
preprocessing, referred to as prqf n; see |14j for more details on these two methods. 

We let the large dimension n increase from le5 to le8, and we fix s = 5e4. For completeness, 
in addition to the skewed data, we will consider two additional data sets. First, we also consider a 
design matrix with entries generated from i.i.d. Gaussian distribution, where the response vector is 
generated in the same manner as the skewed data. Also, we will replicate the census data 20 times 
to obtain a data set with size le8 by 11. For each n, we extract the leading nx d submatrix of the 
replicated matrix, and we record the corresponding running time. The results of running time on 
all three data sets are shown in Figure El 
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Figure 8: The running time for five methods (ipm, prqfn, SPC1, SPC2 and SPC3) on the same 
data set, with d fixed and n changing. The sampling size s = 5e4, and the three columns correspond 
to t = 0.5,0.75,0.95, respectively. 
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From the plots in Figure [8]we see, SPC1 runs faster than any other methods across all the data 
sets, in some cases significantly so. SPC2, SPC3 and prqfn perform similarly in most cases, and 
they appear to have a linear rate of increase. Also, the relative performance between each method 
does not vary a lot as the data type changes. 

Notice that for the skewed data, when d = 50, SPC2 runs much slower than when d = 10. The 
reason for this is that, for conditioning-based methods, the running time is composed of two parts, 
namely, the time for conditioning and the time for solving the subproblem. For SPC2, an ellipsoid 
rounding needs to be applied on a smaller data set whose larger dimension is a polynomial of d. 
When the sampling size s is small, i.e., the size of the subproblem is not too large, the dominant 
running time for SPC2 will be the time for ellipsoid rounding, and as d increase (by, say, a factor 
of 5) we expect a worse running time. Notice also that, for all the methods, the running time does 
not vary a lot when r changes. Finally, notice that all the conditioning-based methods run faster 
on skewed data, especially when d is small. The reason is that the running time for these three 
methods is of the order of input-sparsity time, and the skewed data are very sparse. 



When the lower dimension d changes 

Finally, we will describe the scaling of the running time as the lower dimension d changes. To 
do so, we fixed n = le6 and the sampling size s = le4. We let all five methods run on the data set 
with d varying from 5 up to 180. When d ~ 200, the scaling was such that all the methods except 
for SPC1 and SPC3 became too expensive. Thus, we let only SPC1 and SPC3 run on additional 
data sets with d up to 270. The plots are shown in Figure 
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Figure 9: The running time for five methods (ipm, prqfn, SPC1, SPC2, and SPC3) for solving 
skewed data, with n = le6, when d varies. SPC1 and SPC3 show better scaling than other 
methods when d < 180. For this reason, we keep running the experiments for SPC1 and SPC3 
until d = 270. When d < 100, the three conditioning-based methods can yield 2-digit accuracy. 
When for d £ [100, 180], they can yield 1-digit accuracy. 

From the plots in Figured we can see that when d < 180, SPC1 runs significantly faster than 
any other method, followed by SPC3 and prqfn. The performance of prqfn is quite variable. The 
reason for this is that there is a step in prqfn that involves uniform sampling, and the number of 
subproblems to be solved in each time might vary a lot. The scalings of SPC2 and ipm are similar, 
and when d gets much larger, say d > 200, they may not be favorable due to the running time. 
When d < 180, all the conditioning methods can yield at least 1-digit accuracy. Although one 
can only get an approximation to the true solution by using SPC1 and SPC3, they will be a good 
choice when d gets even larger, say up to several hundred, as we shown in Figure [9j We note that 
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we could let d get even larger for SPC1 and SPC3, demonstrating that SPC1 and SPC3 is able to 
run with a much larger lower dimension than the other methods. 

Remark. One may notice a slight but sudden change in the running time for SPC1 and SPC3 at 
d ~ 130. After we traced down the reason, we found out that the difference come from the time in 
the conditioning step (since the subproblems they are solving have similar size), especially the time 
for performing the QR factorization. At this size, it will be normal to take more time to factorize 
a slightly smaller matrix due to the structure of cache line, and it is for this reason that we see 
that minor decrease in running time with increasing d. We point out that the running time of our 
conditioning-based algorithm is mainly affected by the time for the conditioning step. That is also 
the reason why it does not vary a lot when r changes. 

4.6 Evaluation on solution of Census data 

Here, we will describe more about the accuracy on the census data when SPC3 is applied to it. 
The size of the census data is roughly 5e6 x 11. 

We will generate plots that are similar to those appeared in [9]. For each coefficient, we will 
compute a few quantities of it, as a function of r, when r varies from 0.05 to 0.95. We compute 
a point-wise 90 percent confidence interval for each r by bootstrapping. These are shown as the 
shaded area in each subfigure. Also, we compute the quartiles of the approximated solutions by 
using SPC3 from 200 independent trials with sampling size s = 5e4 to show how close we can get 
to the confidence interval. In addition, we also show the solution to Least Square regression (LS) 
and Least Absolute Deviations regression (LAD) on the same problem. The plots are shown in 
Figure QUI 

From these plots we can see that, although the two quartiles are not inside the confidence 
interval, they are quite close, even for this value of s. The sampling size in each trial is only 5e4 
which is about 1 percent of the original data; while for bootstrapping, we are resampling the same 
number of rows as in the original matrix with replacement. In addition, the median of these 50 
solutions is in the shaded area and close to the true solution. Indeed, for most of the coefficients, 
SPC3 can generate 2-digit accuracy. Note that we also computed the exact values of the quartiles; 
we don't present them here since they are very similar to those in TableHJbelow in terms of accuracy. 
See Table H] in Section [5] for more details. All in all, SPC3 performs quite well on this real data. 

5 Empirical Evaluation on Large-scale Quantile Regression 

In this section, we continue our empirical evaluation with an evaluation of our main algorithm 
applied to terabyte-scale problems. Here, the data sets are generated by "stacking" the medium- 
scale data a few thousand times. Although this leads to "redundant" data, which may favor 
sampling methods, this has the advantage that it leads terabyte-sized problems whose optimal 
solution at different quantiles are known. At this terabyte scale, ipm has two major issues: memory 
requirement and running time. Although shared memory machines with more than a terabyte 
RAM exist, they are rare in practice (now in 2013). Instead, the MapReduce framework is the 
de facto standard parallel environment for large data analysis. Apache HadoopH, an open source 
implementation of MapReduce, is widely-used in practice. Since our sampling algorithm only needs 
several passes through the data and it is embarrassingly parallel, it is straightforward to implement 
it on Hadoop. 

3 Apache Hadoop, http : / /hadoop . apache . org/ 
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Figure 10: Each subfigure is associated with a coefficient in the census data. The shaded area shows 
a point-wise 90% confidence interval. The black curve inside is the true solution when r changes 
from 0.05 to 0.95. The blue and green lines correspond to the £2 and l\ solution, respectively. The 
two magenta curves show the first and third quartiles of solutions obtained by using SPC3, among 
200 independent trials with sampling size s = 5e4 (about 1% of the original data). 
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For a skewed data with size 5e6 x 10, we stack it vertically 2200 times. This leads to a data with 
size l.lelO x 10. In order to show the evaluations similar to Figure [H we implement SC, SPC1, 
SPC3, NOCO and UNIF. Figure [TTI shows the relative errors on the replicated skewed data set by 
using the five methods. These plots correspond with and should be compared to the first two rows 
in Figure [TJ 
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Figure 11: The first (solid lines) and the third (dashed lines) quartiles of the relative errors of the 
objective value (namely, \f — f*\/\f*\) and solution vector (namely, ||x — x*||2/||x*||2), by using 6 
different methods, among 50 independent trials. The test is on replicated skewed data with size 
l.lelO by 10. The three different columns correspond to r = 0.5,0.75,0.95, respectively. 

As can be seen, the method preserves the same structure as when the method is applied to the 
medium-scale data. In this particular case, however, since d is smaller, almost all the conditioning- 
based methods yield 2-digit accuracy when s is small, and SPC3 performs slightly better than other 
methods when s is large enough. In this case, as before, NOCO and UNIF are not reliable when 
s < le4. Also, as r gets larger, the relative errors may become somewhat larger; this is especially 
obvious for the performance of the SC method. 

In order to show more detail on the quartiles of the relative errors, we generated a table similar 
to Table [2] which records the quartiles of relative errors on vectors, measured in l\, £2, and norms 
by using the five methods (here, we did not implement SPC2) when the sampling size s = 5e5 and 
r = 0.75. Table [3] shows similar quantities to and should be compared with Table [2j Conditioning- 
based methods can yield 2-digit accuracy when s = le4 while NOCO and UNIF cannot. Also, the 
relative error is somewhat higher when measured m too norm. 

For the census data, we stack it vertically 2000 times to construct a realistic data set whose size 
is roughly lelO x 11. In Tabled! we present the solution computed by our randomized algorithm 
with a sample size le5 at different quantiles, along with the corresponding optimal solution. As 
can be seen, for most coefficients, our algorithm provides at least 2-digit accuracy. Moreover, 
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Table 3: The first and the third quartiles of relative errors of the solution vector, measured in £\, 
£2, and £oo norms. The test data set is replicated skewed data, with size le6 x 50, the sampling 
size s = 5e5, and r = 0.75. 



in applications such as this, the quantile regression result reveals some interesting facts about 
these data. For example, for these data, marriage may entail a higher salary in lower quantiles; 
Education 2 , whose value ranged from to 256, has a strong impact on the total income, especially 
in the higher quantiles; and the difference in age doesn't affect the total income much in lower 
quantiles, but becomes a significant factor in higher quantiles. 
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[ -0.0179, -0.0117] 


[-0.0200, -0.0149] 


[-0.0225, -0.0189] 


[-0.0500, -0.0448] 


[-0.1112, -0.1032] 


0.0057 


0.0062 


0.0065 


0.0081 


0.0119 


[0.0055, 0.0058] 


[0.0061, 0.0064] 


[0.0064, 0.0066] 


[0.0080, 0.0083] 


[0.0117, 0.0122] 



INTERCEPT 
FEMALE 

Age e [30, 40) 
Age e [40, 50) 
Age g [50, 60) 
Age 6 [60, 70) 
Age > 70 
non_white 

MARRIED 
EDUCATION 
EDUCATION 2 



Table 4: Quantile regression results for the U.S. Census 2000 data. The response is the total annual 
income. Except for the intercept and the terms involved with education, all the covariates are {0, 1} 
binary indicators. 

To summarize our large-scale evaluation, our main algorithm can handle terabyte-sized quantile 
regression problems easily, obtaining, e.g., 2 digits of accuracy by sampling about le5 rows on a 
problem of size lelO x 11. In addition, its running time is competitive with the best existing random 
sampling algorithms, and it can be applied in parallel and distributed environments. 



6 Conclusion 

We have proposed, analyzed, and evaluated new randomized algorithms for solving medium-scale 
and large-scale quantile regression problems. Our main algorithm uses a subsampling technique 
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that involves constructing an £i-well-conditioned basis; and our main algorithm runs in nearly 
input-sparsity time, plus the time needed for solving a subsampled problem whose size depends 
only on the lower dimension of the design matrix. The sampling probabilities used by our main 
algorithm are derived by calculating the t\ norms of a well-conditioned basis; and this conditioning 
step is an essential step of our method. For completeness, we have provided a summary of recently- 
proposed l\ conditioning methods, and based on this we have introduced a new method (SPC3) in 
this article. 

We have also provided a detailed empirical evaluation of our main algorithm. This evaluation 
includes a comparison in terms of the quality of approximation of several variants of our main 
algorithm that are obtained by applying several different conditioning methods. The empirical 
results meet our expectation according to the theory. Most of the conditioning methods, like 
our proposed method, SPC3, yield 2-digit accuracy by sampling only 0.1% of the data on our 
test problem. As for running time, our algorithm is more scalable, when comparing to existing 
competing algorithms, especially when the lower dimension gets up to several hundred, while the 
large dimension is at least one million. In addition, we show that our algorithm works well for 
terabytes-size data in terms of accuracy and solvability. 

Finally, we should emphasize that our main algorithm relies heavily on the notion of l\ condi- 
tioning, and that the overall performance of it can be improved if better l\ conditioning methods 
are derived. 

A Proof of Lemma [8] 

By LemmaEl in Step 1, IT is a low-distortion embedding satisfying Eqn. ([5]) with k\K2 = 0(d 3 log 3 d), 
and r\ = O (d 5 log 5 d) . As a matter of fact, as we discussed in Section [331 the resulting AR^ 1 in 
Step 2 is a well-conditioned basis with k = 0(d~^ log~ d). In Step 3, by Lemma El the sampling 

15 11 

complexity required for obtaining a (1 ± l/2)-distortion sampling matrix is s = 0(d~ log~2~ d). 
Finally, if we view S as a low-distortion embedding matrix with r = s and K2^i = 3, then the 
resulting R in Step 4 will satisfy that AR~ l is a well-conditioned basis with k = 0(d~i log~ d). 

For the running time, it takes 0(nnz(A)) time for completing Step 1. In Step 2, the running 
time is r\d 2 = poly(<i). As Lemma [3] points out, the running time for constructing S in Step 3 
is 0(nnz(^4) • logn). Since the large dimension of S A is a low-degree polynomial of d, the QR 
factorization of it costs sd 2 = poly(<i) time in Step 4. Overall, the running time of Algorithm [3] is 
C(nnz(yl) • log n). 

B Proof of Lemma [9] 

Since U is a well-conditioned basis for the range space of A, to prove (JSj) it is equivalent to prove 
the following holds for all y £ M. d , 

(1 - e)p T {Uy) < p T (SUy) < (1 + e)p T (Uy). (10) 

To prove that (|10p holds for any y £ R d , firstly, we show that (|10p holds for any fixed y € M. d ; and, 
secondly, we apply a standard 7- net argument to show that (fTUj) holds for every y 6 M. d . 

Assume that U is (a, /3)-conditioned with k = aft. For i £ [n], let Vi = U^y. Then p T (SUy) = 
J2ie[n] PriSnVi) = Yji£[n] SiiP T {vi) since Su > 0. Let Wi = Sup T (vi) - p T (vi) be a random variable, 
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and we have 



W; 



(f ~ 1 )Pt{vi), with probability pi; 



-Pr(Vi), 



with probability 1 — pi 



Therefore, E[w;j] = 0, Var[u>j] = (4- — l)p T (vi) 2 ,\uti\ < j-p T (vi)- Note here we only consider i 
such that \\U^ ||i/||?7||i < 1 since otherwise we have pi = 1, and the corresponding term will not 
contribute to the variance. According to our definition, pi > s ■ \\Ut^ \\i/\\U\\i = s ■ ij. Consider the 
following, 

Pr{vi) = p T {U(i)y) < T\\U {i) y\\i < r||(C/( i ))||i||y|| 00 . 

Hence, 

11 r 

\Wi\ < —Pr{Vi) < — T K7(j) i \\y oo < - !7 l J/ a, 



Pi 

1 r 



P; 



< -- 



s 1 — r 



af3p T {Uy) := M. 



Also, 



^ VarH < ^p r K) 2 < Mp T {Uy). 

■,-r l -,-r l P* 



«G[nJ iG[n] 

Applying the Bernstein inequality to the zero-mean random variables Wi gives 



Pr 



ien 



II', 



> e 



< 2exp 



2"£ i Var[w i ] + iMe 



Since X^ie[n] ^ = Pr{SUy) — p T (Uy), setting e to ep T (Uy) and plugging all the results we derive 
above, we have 



Vx[\p T {SUy) - p T (Uy)\ > ep T {Uy)} < 2exp 



e 2 p 2 (Uy) 



2Mp T (Uy) + ^Mp T (Uy) 



Let's simplify the exponential term on the right hand side of the above expression: 



e 2 p 2 T (Uy) 



-se 2 1 - r 1_ 

2Mp T (Uy) + 'fMp T (Uyj ~ ~afi ~ 2T~ 



< 



-se z 1 - r 
3a/3 r 



Therefore, when s > 



27af3 



dlog ( | J + log (|) J , with probability at least 1 - (j/3) d 5/2, 



1 T I \ 7 

(1 - e/3)pr(Uy) < Pr(SUy) < (1 + e/3)p T (Uy), 



where 7 will be specified later. 

We will show that, for all z 6 range(£7), 



(1 - e)p r (z) < p T (Sz) < (1 + e)p T (z). 



(11) 



(12) 



By the positive linearity of p T (-), it suffices to show the bound above holds for all z with \\z\\i = 1. 

Next, let Z = {z£ range (i7) | < 1} and construct a 7-net of Z, denoted by Z~, such that 
for any z E Z, there exists a z 7 £ Z 7 that satisfies ||.z — -z 7 ||i < 7. By [1], the number of elements 
in Zj is at most (3/j) d . Hence, with probability at least 1 — 5/2, (jlip holds for all z 7 G 

We claim that, with suitable choice 7, with probability at least 1 — 5/2, S will be a (1 ± 2/3)- 
distortion embedding matrix of (A, || • ||i). To show this, firstly, we state a similar result for || • ||i 
from Theorem 6 in [6] with p = 1 as follows. 
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Lemma 11 {l\ Subspace-preserving Sampling Lemma). Given A G R nxd , let U G M nxd be an 
(a, f3)- conditioned basis for A. For s > 0, define 

Pi > min{l,s • ||f7(t)||i/||f7||i}, 

and let S G M. nxn be a random diagonal matrix with Sa = \/p% with probability pi, and otherwise. 
Then when e < 1/2 and 



with probability at least 1 — 5, for every x G M. d , 

(1 - e)\\Ax\\i < \\SAx\\i < (1 + (13) 

Note here we change the constraint e < 1/7 and the original theorem to e < 1/2 above. One can 
easily show that the result still holds with such setting. If we set e = 2/3 and the failure probability 
to be at most 5/2, the construction of S defined above satisfies conditions of Lemma ITT1 when the 
expected sampling complexity s > s := 72a(3 (d log (18) + log (f)). Then our claim for S holds. 
Hence we only need to make sure with suitable choice of 7, we have s > s. 

For any z with = 1, we have 

\p T (Sz) - Pr(z)\ < \Pt(Sz) - p T (Sz^)\ + \p T (SZy) - p T (Zy)\ + \Pt(Zj) ~ Pr{z)\ 

< t\\S(z - 2 7 )||i + (e/3)p r (z 7 ) + r||z 7 - z\\i 

< t\\\S(z - Zj)\\i -\\(z- z 7 )||i| + (e/3)p T (z) + (e/3)p T (z 7 - z) + 2r||z 7 - z\\\ 

< 2t/3\\z — z 7 ||i + (e/S)p T (z) + r(e/3)||z 7 — z\\\ + 2r||z 7 — z\\i 

< {e/S)p T (z) + r 7 (2/3 + e/3 + 2) 

^ ( e / 3 + 7377(2/3 + e/3 + 2)Jp r (2r) 

< ePr(^), 

where we take 7 = -^jr-e, and the expected sampling size becomes 



1 — re 2 \ \l — t e J \6 

When e < 1/2, we will have s > s. Hence the claim for S holds and f)12[) holds for every z G 
range(f7). 

Since the proof is involved with two random events with failure probability at most 5/2, by a 
simple union bound, (jlQf) holds with probability at least 1 — 5. Our results follows since k = a/3. 



C Proof of Lemma [10 

In this lemma, slightly different from the previous notation, we will use s and s to denote the actual 
number of rows selected and the input parameter for defining the sampling probability, respectively. 
From Lemma [U a (1 ± e)-distortion sampling matrix 5 could be constructed by calculating the i\ 
norms of the rows of AR~ l . Indeed, we will estimate these row norms and adjust the sampling 
complexity s. According to Lemma 12 in [5], with probability at least 0.95, the Aj,i G [n] we 
compute in the first three steps of Algorithm H] satisfy 

1 3 
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where U = AR . Conditioned on this high probability event, we set & > min \ 1, s ■ t-* — r- \. 

Then we will have p\ > min 1 1 , | • ^'j^ 1 1 . Since s /3 satisfies the sampling complexity required 
in Lemma [9] with 5 = 0.05, the corresponding sampling matrix S is constructed as desired. These 
are done in Step 4 and Step 5. Since the algorithm involves two random events, by a simple union 
bound, with probability at least 0.9, S is a (1 ± e)-distortion sampling matrix. 

By the definition of sampling probabilities, E[s] = Ylie[n] Pi — Note here s is the sum of some 
random variables and it is tightly concentrated around its expectation. By a standard Bernstein 
bound, with probability 1 — o(l), s < 2s = O (pudlog (p/e) /e 2 ), where p = as claimed. 

Now let's compute the running time in Algorithm HI The main computational cost comes from 
Steps 2, 3 and 5. The running time in other steps will be dominated by it. It takes d 2 T2 time to 
compute i?~ 1 ri2; then it takes C(nnz( J 4) • T2) time to compute AR~ 1 Ti2] and finally it takes 0(n) 
time to compute all the Aj and construct S. Since T2 = O(logra), in total, the running time is 
0((d 2 +nnz(A)) log n + n) = C(nnz(A) -logn). 

D Proof of Theorem [I] 

In Step 1, by Lemma [71 the matrix R € M. dxd computed by Algorithm [2] satisfies that with proba- 
bility at least 0.9, AR -1 is a well-condition basis for A with n = 6d 2 . The probability bound can 
be attained by setting the corresponding constants sufficiently large. In Step 2, when we apply 
Algorithm 0] to construct the sampling matrix S, by Lemma [TUl with probability at least 0.9, S 
will be a (1 ± e)-distortion sampling matrix of (A,p T (-)). Solving the subproblem min xg c p T (SAx) 
gives a (1 + e)/(l — e) solution to the original problem ([3]). This is because 

Pr(Ax) < -^—p T {SAx) < J—p r (SAx*) < ±±^p T (Ax*), (14) 

where the first and third inequalities come from ([9]) and the second inequality comes from the fact 
that x is the minimizer of the subproblem. Hence the solution x returned by Step 3 satisfies our 
claim. The whole algorithm involves two random events, the overall success probability is at least 
0.8. 

Now let's compute the running time for Algorithm [5j In Step 1, by Lemma [7J the running 
time for Algorithm [2] to compute R is C(nnz^4). By Lemma PT0"| the running for Step 2 is 
C(nnz(yl) - logn). Furthermore, as stated in Lemma 1101 and k^AR^ 1 ) = 2d 2 , with probability 
1 — o(l), the actual sampling complexity is O (pd 3 log (p/e) /e 2 ) , where p = t/(1 — r), and it takes 
4> (O (pd 3 log (/u/e) /e 2 ) , d) time to solve the subproblem in Step 3. This follows the overall running 
time of Algorithm [5] as claimed. 
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